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Abstract 

This paper exploits a simplified version of the mixture of multivariate t-factor analyzers 
(MtFA) for robust mixture modelling and clustering of high-dimensional data that frequently 
contain a number of outliers. Two classes of eight parsimonious t mixture models are in- 
troduced and computation of maximum likelihood estimates of parameters is achieved using 
the alternating expectation conditional maximization (AECM) algorithm. The usefulness of 
the methodology is illustrated through applications of image compression and compact facial 
representation. 
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1 Introduction 

The mixture of factor analyzers (MFA) model, first introduced by Ghahramani and Hinton (1997) 
and further generalized by McLachlan and Peel (2000a) , has provided a flexible dimensionality re- 
duction approach to the statistical modelling of high-dimensional data arising from a wide variety of 
random phenomena. By combining the factor analysis model with finite mixture models, the MFA 
model allows simultaneous partitioning of the population into several subclasses while performing 
a local dimensionality reduction for each mixture component. In recent years, the study of MFA 
has received considerable interest; see McLachlan and Peel (2000b) and Fokoue and Titterington 
(2003) for some excellent reviews. McLachlan et al. (2002) and McLachlan et al. (2003) exploited 
the MFA approach to handle high-dimensional data such as clustering of microarray expression 
profiles. McNicholas and Murphy (2008) generalized the MFA model by introducing a family of 
parsimonious Gaussian mixture models (PGMMs). 

In the MFA framework, component errors and factors are routinely assumed to have a Gaussian 
distribution due to their mathematical tractability and computational convenience. In practice, 
noise components or badly discrepant outliers often exist; the multivariate t (MVT) distribution 
contains an additional tuning parameter, the degrees of freedom (df), which can be useful for 



•Corresponding author. E-mail: tilin@amath.nchu.cdu.tw; tel.:+886 4 22850420; fax: +886 4 22873028. 



1 



outlier accommodation. Specifically, the density of a g-componcnt multivariate t mixture model is 



where the Wi are mixing proportions and i p (y|/ij, Sj, Vi) is the density of p-variate t distribution 
with df Vi, mean vector and scaling covariance matrix Sj. Note that the general t mixture 
(TMIX) model (1) has a total of (g — 1) + gp + gp(p + l)/2 + g free parameters, of which gp(p + 
l)/2 + g parameters correspond to the component matrices 5^ and df vi, respectively. In a more 
parsimonious version of (1), wherein the X.; and are restricted to be identical across groups, the 
number of free parameters reduces to (g — 1) + gp + g(p + 1) + p(p + l)/2. 

The TMIX model was first considered by McLachlan and Peel (1998) and Peel and McLachlan 
(2000), who presented expectation-maximization (EM) algorithms for parameter estimation and 
show the robustness of the model for clustering. Further developments along these directions 
followed, including work by Shoham (2002); Lin et al. (2004, 2009); Wang et al. (2004); Greselin 
and Ingrassia (2010); Andrews et al. (2011). An extension to mixture of i-factor analyzers (MtFA), 
adopting the family of MVT distributions for the component factor and errors, was first considered 
by McLachlan et al. (2007) and more recently by Andrews and McNicholas (2011a,b). 

In this paper, our objective is to illustrate the efficacy of a class of TMIX models with sev- 
eral parsimonious covariance structures, called parsimonious t mixture models (PTMMs). The 
PTMMs are based on assuming a constrained f-factor structure on each mixture component for 
parsimoniously modelling population heterogeneity in the presence of fat tails; the PTMMs are 
^-analogues of the PGMMs. We are devoted to developing additional tools for a simplified version 
of the MODt family of Andrews and McNicholas (2011b) and applying the proposed tools on image 
reconstruction tasks. 

The EM algorithm (Dempster et al., 1977) and its extensions, such as the expectation condi- 
tional maximization (ECM) algorithm (Meng and Rubin, 1993) and the expectation conditional 
maximization either (ECME) algorithm (Liu and Rubin, 1994, 1995), have been practiced as useful 
tools for conducting maximum likelihood (ML) estimation in a variety of mixture modelling sce- 
narios. To improve the computational efficiency for fitting PTMMs, we adopt a three-cycle AECM 
algorithm (Meng and van Dyk, 1997) that allows specification of different complete data at each 
cycle. 

The rest of the paper is organized as follows. In Section 2, we briefly describe the single t 
factor analysis model and study some related properties. In Section 3, we present the formulation 
of PTMM and discuss the methods for fitting these models. In Section 4, we demonstrate how 
PTMM can be applied to image compression and compact facial representation tasks. Some 
concluding remarks are given in Section 5. 
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2 The t factor analysis model 



Wc briefly review the t factor analysis (tFA) model, which can be thought of as a single component 
MtFA model. Let Y = (Y\, . . . , Y n ) be a set of p-dimensional random vectors. In the tFA setting, 



each Yj is modelled as 



with 



Yj , = n + IW, + Bj 



(2) 



U; 



L q+P 
















where fi is a p-dimensional mean vector, B is a p x q matrix of factor loadings, Uj is a g-dimensional 
(q <C p) vector of latent variables called common factors, Sj is ^dimensional vector of errors called 
specific factors, l q is a g-dimensional identity matrix, \& = diag{-0i, . . . ,i/j p } is a diagonal matrix 
with positive entries, and v is the df. 

Note that Uj and Sj are uncorrelated and marginally i-distributed, but not independent. Using 
the characterization of the t distribution, model (2) can be hierarchically represented as 



U 4 I T4 



N q (0,TfX), 
Gamma(y/2, is/2), 



where Gamma(a, 0) stands for the gamma distribution with mean a/ ft. It follows that the marginal 
distribution of Yj, after integrating out Uj and Tj, can be expressed as Yj ~ t p (fi, BB T + \&, v). 

When handling high-dimensional data for large p relative to n, say p 3> n, the inverse of 
BB T + ^ plays an important role in computational complexity. In such a case, we use the 
following matrix inversion formula (Woodbury, 1950): 



(BB T + = * 1 - * _1 B(I 9 + B T *" 1 B) 1 B T *" 1 



(3) 



which can be done more quickly because it involves only the low-dimensional q x q inverse plus the 
inversion of a diagonal p x p matrix. Moreover, the determinant of BB T + \& can be calculated as 



|BB T + *| = [I, + B T *- 1 B| JJ^< 



(4) 



i=l 



The formulae (3) and (4) have been used many times before, including work by McLachlan and 
Peel (2000a) and McNicholas and Murphy (2008). 



3 Parsimonious multivariate t mixture models 

Consider n independent p-dimensional random vectors Yi, . . . , Y n that come from a heterogeneous 
population with g non-overlapping classes. Suppose that the density function of each feature Yj 
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can be modelled by a g-component MtFA: 

f{Y j | 0) =J2^it P (YM, B i B J + (5) 

i=l 

where the iu, are mixing proportions that are constrained to be positive and 2f=i = 1> ^» 
is a p x (j matrix of factor loadings, and S&i = diagj^ii, ■ • ■ , ipip} is a diagonal matrix. We 
use = (wi, . . . , w g , 0i, . . . , g ) to represent all unknown parameters with 0i = (ft, Bi, , i/j) 
containing the parameters for the density of component i. 

As in McLachlan et al. (2007), the MtFA model can be alternatively formulated by exploiting 
its link with the tFA model. Specifically, we assume that 

= ft + B iUij + £ij 

with probability Wi, for i — l,...,g and j = 1, ...,n, where Uij and are, respectively, the 
common and specific factors corresponding to component i. We assume that Uij and £y have a 
joint MVT distribution to be consistent with model (5) for the marginal distribution of Yj. Under 
a hierarchical mixture modelling framework, each Yj is conceptualized to have originated from one 
of g classes. It is convenient to construct unobserved allocation variables Zj = (Z%j, . . . , Z g j) T , 
for j = l,...,n, whose values are a set of binary variables with Zij = 1 indexing so that Yj 
belongs to class i and are constrained to be J2i=i Zij = 1 for each j. More specifically, Zj 
is distributed as a multinomial random vector with one trial and cell probabilities wi, . . . ,w g , 
denoted by Zj ~ A4(l; Wi, . . . , w g ). After a little algebra (cf. McLachlan et al., 2007), we have 

Uij | (Y j ,z ij = i)^t q (rJ(Y j -n i ),^^-n i ,u + p), (6) 

where = (B !; B^ + ^ l )- 1 B l , a, = l q - if B, and 6 tJ = (Yj - ^(B.Bj + ^^(Yj - ft) 
denotes the Mahalanobis-squared distance between Yj and /x^ 

Following McNicholas and Murphy (2008), we extend the MtFA model by allowing constraints 
across groups on matrices B^ and \Ev Specifically, we allow the following constraints on the scaling 
covariance matrices: B^ = B; \&i = or \&i = ipj. p , for i = 1, . . . , g. In addition, we consider 
the constraint v$ = v (i = l,...,g) when we impose this constraint to reduce the number of 
parameter in PTMMs, our family of models (Table 1) is called 'PTMM1', and when the Ui are 
allowed to vary across components, we call our family 'PTMM2'. In this latter case (PTMM2), 
one simply adds g — 1 parameters to the last column of Table 1. The number of free covariance 
parameters in the PTMM family — the union of PTMM1 and PTMM2 — can be reduced to as 
few as qp — q(q — l)/2 + 2 or inflated to as many as g[pq — q(q — l)/2 +p + 1]. 

The implementation of an efficient 3-cycle AECM algorithm for estimating PTMMs together 
with some practical issues including the specification of starting values, the stopping rule and the 
model selection criterion are described in Supplementary Material. 
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Table 1: The eight covariance structures, used by McNicholas and Murphy (2008), that form the 
basis of the PTMM considered herein. The "Constrained" in the columns "Loading" , "Error" , and 
"Isotropic" represent = B, \&i = SI/ and ipik = tpi for i = 1, . . . , g and k = 1, . . . , k, respectively. 



Structure 


Loading 


Error 


Isotropic 


Number of parameters 


CCC 


Constrained 


Constrained 


Constrained 


(pq - q(q 


- l)/2) + 1 + 1 


ecu 


Constrained 


Constrained 


Unconstrained 


{pq - q(q 


- l)/2)+p+l 


cue 


Constrained 


Unconstrained 


Constrained 


(pq - q(q 


- l)/2) + fl +l 


CUU 


Constrained 


Unconstrained 


Unconstrained 


(pq - q(q - 


l)/2)+flp + l 


UCC 


Unconstrained 


Constrained 


Constrained 


g(pq - q(q 


- l)/2) + 1 + 1 


UCU 


Unconstrained 


Constrained 


Unconstrained 


g(pq - q(q 


- l)/2)+p+l 


UUC 


Unconstrained 


Unconstrained 


Constrained 


g(pq - q(q 


- l)/2)+fl + l 


uuu 


Unconstrained 


Unconstrained 


Unconstrained 


g(pq - q(q - 


l)/2)+pp+l 



4 Applications 

4.1 Image compression 

A colour image is a digital image that includes 24-bit RGB colour information for each pixel. An 
RGB image is derived from the three primary colours — red (R), green (G), and blue (B) — for 
which each colour is 8 bits. The technique of image compression plays a central role in image 
transmission and storage for high quality. There are several unsupervised image compression 
approaches based on probabilistic models. The MFA model is well recognized as a dominant 
dimension reduction technique and is useful in block image transform coding (Ucda et al., 2000). 
In this subsection, we individually apply the PTMMs and PGMM to colour image compression and 
compare the quality of the reconstructed images. A 512 x 512 RGB colour image 'Lena' (encoded 
in 24 bits per pixel) is subdivided into n = 16384 non-overlapping RGB-blocks of 4 x 4 x 3 pixels 
and each block is taken as a 48-dimensional data vector y. Let y be a set of collected y vector. 
Making a slight modification of Ueda et al. (2000), our compression procedure to transform the 
experimental image is summarized below. 

1. Set the desired number of components g and dimensionality of factors q. 

2. Estimate © by fitting a parsimonious mixture model to y. 

3. Perform a model-based clustering according to the component membership of data point y, 

which is decided by maximizing the posterior probability Pr(Z,j = l\y). 

4. For each y^ G y classified to Cj, calculate 

Vj = fri+ViUij, (7) 

„ T » - T 

where Uij — B 4 (BjBj + i f?i)~ 1 (Yj — fa) is the estimated posterior mean of (6). 

5. Reconstruct jjj, j = 1, . . . , n, into a compressed color image. 
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Table 2: Comparisons of fitting adequacy and quality of image reconstruction using three CUU 
models with (g = 4; q = 4) and (g = 8; q = 8). 



( 9 = 4; 9 = 4) 


Model 


fmax(xl0 3 ) 


m 


BIC(xl0 3 ) 


RMSE 


PSNR 


PGMM 


2104.4 


189 


4206.8 


7.62 


30.5 


PTMM1 


2205.9 


190 


4409.7 


5.22 


33.8 


PTMM2 


2204.6 


193 


4407.1 


5.26 


33.7 






(9 = i 


I; ? = 8) 






Model 


J?max(xl0 3 ) 


m 


BIC(xl0 3 ) 


RMSE 


PSNR 


PGMM 


2235.1 


363 


4466.3 


7.03 


31.2 


PTMM1 


2348.8 


364 


4693.7 


2.78 


39.2 


PTMM2 


2351.2 


371 


4698.5 


2.43 


40.4 



BIC = 2£ max — mlogn. Models with large BIC scores are preferred. 



In what follows, we assumed that the characteristic of each components in the mixture model 
are the same, which leads to the MFA model with common factor loadings (CUU structure). The 
Lena image data are fitted by the CUU structure with (g = 4, q = 4) and (g = 8, q = 8) using the 
PGMM and PTMM approaches. To evaluate the quality of the reconstructed image, we compute 
the root mean squared error (RMSE) and peak signal-to-noise ratio (PSNR), which are widely 
used metrics in the image coding field. For an original color image {In(x,y), Ia{x,y), Ib{x,u)} of 
size M x N and a reconstructed image {Ir(x , y) , Ig{x, y) , Ib(x, y)} , the RMSE is defined as 



RMSE = 




{MSE R + MS E G + MSE B ), 



where 

M N 

MSE R = — £ £ [l R (x, y) - I R (x, y)} \ 

and the other two quantities MSEq and MSEb are defined in the same fashion. The measure 
RMSE represents the average Euclidean distance from the original image to the output one. For 
a 24-bit colour [0, 255] 3 image, the PSNR is given by 

/ o^S \ 

psnr = 201o HrW- 

Note that the higher PSNR value indicates a compressed (i.e., reconstructed) image of better 
quality. 

Experimental results are provided in Table 2, including the log-likelihoods, the number of 
parameters, and BIC values, together with the RMSE and PSNR for the compressed images. The 
performance of PTMM1 and PTMM2 are almost the same in terms of RMSE and PSNR for the 
compressed images. Notably, the BIC and PSNR values obtained from the PTMM approaches are 
much higher than the PGMM for all cases. 

Figure 1 shows the original Lena image and three compressed images obtained by fitted the 
CUU (g = 8; q = 8) models. The quality of the reconstructed images using the PTMMs is much 
better (i.e., smoother) than those reconstructed using the PGMM (for instance, observe the edge 
of Lena's shoulder). 
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Original FGMM 




Figure 1: An example of Lena colour image reconstruction using three CUU models with 3 = 8 
and q = 8. 

4.2 Compact facial representation 

Over the past few decades, numerous template matching approaches, along with their modifications 
and variants, have been proposed for human perception system face recognition; see Zhao et al. 
(2003) for an extensive survey. Motivated by a technique developed by Sirovitch and Kirby (1987), 
the eigenface method (Turk and Pentland, 1991) has become the most popular tool for facial feature 
extraction in computer vision research. Its implementation is primarily based on the use of PCA, 
also known as the Karhunen-Loeve transform. The central idea of PCA is to find a multilinear 
subspace whose orthonormal basis maximizes the scatter matrix of all the projected samples. 

Although the eigenface method is remarkably simple and quite straightforward, its performance 
depends heavily on pose, lighting and background conditions. Because PCA is a one-sample 
method, a single set of eigenfaces may not be enough to represent face images having a large 
amount of variation. Some efforts have been made to generalize it to the mixture-of-eigenfaces 
method Kim et al. (2002), which generates more than one set of eigenfaces for better representation 
of the whole face. However, such a method will produce an over-parameterized solution in many 
circumstances. To capture the heterogeneity among the underlying face images while maintaining 
parsimony in modeling, we apply the PTMM approach with common factor loadings (CUU) as a 
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novel strategy for robust facial representation. 

Consider a training set of n face images Ii, . . . ,I n . Suppose that all images are exactly the 
same size; a pixels in width by b in height. Let y = {yj = vec(/j)}™ =1 be the transformed image 
vectors taking intensity values in a p-dimcnsional image space, where p = a x b. In what follows, 
we briefly review the eigenface method. Define the total scatter matrix of the sample images as 

n 

where y is the mean image vector of all samples. In PCA, the optimal orthonormal basis vectors 
J~ipcA S R pxK , where K « p, are chosen to maximize the following objective function 

Epca = [ei • • • e K ] = argmax_g|£; T ST-E|, 

where {ek}k=i is the set of p-dimensional vectors of St corresponding to the K largest eigenvectors, 
which are referred to as eigenfaces in Turk and Pentland (1991) because they have exactly the 
same dimension as the original images. Each face image vector y i can be represented as a linear 
combination of the best K eigenvectors: 

K 

k=l 

where u jk = eJiVj - v)- 

As another illustration, we compare the face reconstruction performance among the PGMM 
and PTMM approaches using the image compression algorithm described in the above example 
and the eigenface method. We use the face images in the Yale face database, which contains 165 
grayscale images of 15 individuals in GIF format. There are 11 images per person with pose and 
lighting variations, but for illustrative purposes and simplicity we consider only 11 images of one 
individual. For each image, we manually select the centers of eyes, denoted by {x,y). The four 
boundaries of each cropped central part of the face are x — 60, x + 60, y — 100, and y + 60, which 
gives images of 121 x 161 pixels as a sample in our experiment. 

In this experiment, the 11 face images are trained by seven models: the PCA, PGMM, and 
PTMM1 models with CUU structure, where K = q = 5 and g = 1,2, 3. The reconstructed images 
can be obtained by following two steps. 

1. In each model, the fitted vector is obtained as a reconstructed image, say y^. For the PCA 

method, y j = yJ CA as defined in (8). As for the PGMM and PTMM1, the solution is 
obtained by using (7). 

2. If the entries of ijj do not fall in the color space, such as [0,1], the reproductions y* are 

normalized by 

%r-min{y-} 

Vjr= Fl • r- t for r = l,...,p. 

■> max{y}-mm{y} 
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Figure 2: Reconstructed images based on PCA, PGMM and PTMM approaches. The first column 
shows sample images for one individual of the Yale database. 

Figure 2 shows the original and the reconstructed images obtained from the seven training 
models. PCA clearly has the worst performance, while the PGMM and PTMM lead to somewhat 
comparable results. For explicitly measuring the reconstruction quality, we further calculated the 
RMSE, denoted by [{y } - y*) T {y 3 - y*)] 1/2 , for each image and each model. Figure 3 displays 
the patterns of RMSE values. When comparing these models, smaller RMSEs indicate better 
reconstructions. In general, PTMM1 yields lower RMSE values than those from PGMM. As a 
result, we conclude that PTMM1 can be a prominent tool for facial coding. 
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Figure 3: RMSE values of seven trained models for 11 face images. 



5 Concluding remarks 

Wc have utilized a class of parsimonious t mixture models, called the PTMMs, which may create 
tremendous flexibility in robust clustering of high-dimensional data as they are relatively insensi- 
tive to outliers. This model-based tool allows practitioners to analyze heterogeneous multivariate 
data in a broad variety of considerations and works particularly well in high-dimensional settings. 
Numerical results show that the proposed PTMM approach performs reasonably well for the ex- 
perimental image data. As pointed by by Zhao and Yu (2008) and Wang and Lin (2013), the 
convergence of the AECM algorithm can be painfully slow in certain situations. It is a worthwhile 
task to pursue some modified algorithms toward fast convergence. 
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Appendix 



A: ML estimation via the AECM algorithm 

We discuss how to carry out the AECM algorithm for computing the ML estimates of the 
parameters in PTMMs. To formulate the algorithm which consists of three cycles, we partition the 
unknown parameters © as (©1, ©2, ©3), where ©i contains w^s, ©2 contains fi^s and i/j's, while 
©3 contains Bj's and S&i's. Let Z = (Zi, . . . , Z n ) be the collection of latent group labels. In the 
first cycle of the algorithm, we treat Z as missing data. So the complete data is 3^aug = 
and our aim is to estimate ©1 with ©2 and ©3 fixed at their current estimates ©2 and ©3. 

The lc 
is given by 



The log-likelihood of ©1 based on the complete data ^aiig> a P ar t horn an additive constant, 



n g 



4 1] (eiinig) = ]T5> 4 ,iog W4 . (9) 

3=1 <=1 



Conditioning on Y and 0, taking expectation for (9) leads to 



g n 



qW(©i|6) = J2J2^ lo § w * ( 10 ) 

i=l j=l 



with 

Wit p (Yj I Ai, B A + &i,e>i) 



A « T 



Maximizing (10) with respect to Wi, that restricted to w i = yields 



n 

At the second cycle, when estimating ©2, we treat r and Z as the missing data. The log- 
likelihood function of ©2 based on the complete data ^aug = C^aug, T ) takes the form of 

n g -. 

4 2l (e2|^k 2 U g) = ^2J2z ij {--r j (Y j - fii f(B i B i +^i)- 1 (Y j -^) 

j=i i=i 

^og(f)-iogr(f) + |(io gTj - Tj )}. (ii) 



2 

Conditioning on Y and 0, taking expectation for (11) leads to 

g n 



Q[ 2 1(0 2 |0) = -i£^%r u -(r j - / x J ) T (B l B^ + *0" 1 (^-M i ) 

i=i j=i 

g n 

E E % [7 Job (f ) - lo g r (?) + t ( *« ~ M (12) 



»=i 3=1 

with 



and K y = DC — — - log — - 



5ij 
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where 5 i3 = (Y 3 - // i ) T (B J B i + * i )" 1 (Y' j - /tj and DG(ck) = r'(a)/r(a) is the digamma 
function. 

Maximizing (12) with respect to \i i and vi yields 

V™ Z--T--Y- 



En ~ 
j—l z ij T ij 



and 



Vi = argmax„. i — log —J - logr^yj + — 
which is equivalcntly the solution of the following equation 



X)j=l Zij^Kij T~ij) 



log 



DG 



1 + — ; = 0. 



In the case of V\ 



v„ = we obtain v as the solution of the following equation: 



l0 H2J- DG l2) +1 



0. 



At the third cycle, when estimating ©3, we treat U, r and Z as the missing data. The 
log-likelihood function of ©3 based on the complete data la U g = (^aug; U) takes the form of 

g n 



r 1 1 

i=l j=l 

^(/^B, 1 *, 'B,(/„}. (13) 



Therefore, the expectation of (13) conditioning on the observed data Y and the updated vales of 
is 

a -, a 



)[3] 



!©3|©) = ^nilogl^ril-I^^trj^Si-^B^Si 



where : 



1 S?=i %'%C*j _ Ai)C*j - Ai) T and Hi = Ig-f^Bj. The resulting 



CM-steps for the updates of B^ and \&i under eight constrained/unconstrained situations are given 
in the next Section. Note that the above procedure is also applicable for the tFA model by treating 



Zij = 1. 



B: Calculations for eight parsimonious models 

To facilitate the derivation, we adopt the following notation: 

S = J2 ^ " ilY. Wij {Yj - ^){Y 3 fi.f 

i—1 i—1 j=l 
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B.l: Model CCC 

For model CCC, we have = B, = \J f = ipl p . The Q^'-function can be expressed as 
Q[2] = !^pi g^-i _ J tr {s} - 2tr{Bf T S} + tr{B(0 + f T Sf )B T }] , 

where f = (BB T + # P ) _1 B and Cl = I q - B T f . Differentiating Q[ 2 ](0 2 |0) with respect to B 
and respectively, and set the derivations equal to zero, we obtain 



1 



B = Sr(n + r i Sr)- i and v=-tr{s-Br S} 



B.2: Model CCU 

For model CCU, we have B^ = B, \&; = \&. The -function can be expressed as 

tr{¥ _1 S} - 2tr{*" 1 Bf T S} + tr{* _1 B(fi + f T Sf )B T } 



g[ 2 ](0 2 |©) = ^iog|*-Vf 



where f = (BB + *) _1 B. Differentiating Q^(& 2 \&) with respect to B and respectively, 
and set the results equal to zero, we obtain 



B = Sf(n + f T Sf)- 1 and * = Diag{S - Bf T S}. 



B.3: Model CUC 

For model CUC, we have B^ — B, \&i = ipil p . The -function can be expressed as 

Q[2] [p 1o § IC'I - ^tiiSi] + 2^ 1 tr{BffS i } - ^hviBifl, + f^f i)B T } 

i=l 

where f* = (BB T + ^I p ) -1 B and ti t =I q - fjB. Differentiating <2 [2l (0 2 |0) with respect to B 
and ipT 1 > respectively, and set the results equal to zero, we obtain 



B 



,1=1 v ' 



-I -1 



and 



i = itr{S, - 2-BtJSi + B{£li + f JS t t l )B, T }. 



B.4: Model CUU 

For model CUU wc have B; = B. The -function can be expressed as 

q[2] = £ | [log |*r*| - tr{*ri Sl } + 2tv{*- 1 BTjS t } - tr^Jjfo + f ?S,f i )fl T } 
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where t { = (BB + ^i)' 1 ^ and toi = I q - fjB. Differentiating Q[ 2 1(© 2 |0) with respect to B 
and \E r 7 ^ 1 , respectively, and set the results equal to zero, we have 

= diagjS, - 2Bf fSi + B(A< + f ^S.f 4 )B T }. 

However, there is no closed form for the loading matrix B. One can solve it through a row-by-row 
manner. Set 

B( pX(? ) = &i — bfc ■■■b p 

where bk = [bki bk2 • • • &fcg] represents the fcth row of the matrix B. Let R = X)f=i ^i^T ^iTi 
and r/, be hth row of the matrix R. The /ith row of B can be expressed as 

a 

b h J2^(n i + f?s i r i ) = r h , 

i=l i>i(h) 

where ^(h) denotes the hth element along the diagonal of Hence, 

-l 



bh = r h 



.1=1 



, for /i = 1, . . . , jj. 



B.5: Model UCC 

For model UCC we have \&i = ipl p . The Q' 2 ' -function can be expressed as 

9 n 

Q l2] =Ef [P l °Z l^" 1 ! " V'-'trlSJ + 2^" 1 tr{B l f - V^B^ + f^f^} 



where t t = (BjB^ + ■0I p )" 1 B i and O., = I q - tjBi. Differentiating Q[ 2 1(0 2 |0) with respect to 
Bj and respectively, and set the results equal to zero, we obtain 

t 1 9 

B, = SiTiiSli + T { SiTi)- 1 and i/> = - V m^S, - B^f ^S,;}. 

P ~1 



B.6: Model UCU 

For model UCU we have SP, = \&. The -function can be expressed as 

9 

Q l2] = £ y [log - tri*" 1 ^} + 2tr{*- 1 B 4 f TS,} - tr^^B^ + tJsfijBj} 



where f 8 = (B t B, + *) J B t and O, = I, f i B,. Differentiating Q[ 2 1(© 2 |0) with respect to B, 
and respectively, and set the results equal to zero, we obtain 



B i = S<f' i (n < +f7s i f , < )- 1 and * = ^]u; 4 Diag{S l -B I f^S l }. 
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B.7: Model UUC 

For model UUC we have St; = ^I p . The -function can be expressed as 

Q[2] = V± [pbg ^r 1 ! - 4rhr{S t } + 2^- 1 tr{B ?; f JS t } - ^{B^ + f T^f OB?} 

i=l 

where I\ = (BjBj + -0 i Ip)" 1 B i and O, = I, - T, Bj. Differentiating (5[ 2 1(0 2 |0) with respect to 
Bi and ; respectively, and set the results equal to zero, we obtain 

B t = SiTifa + tJSiti)- 1 and & = - tr{S, - Bif ?Si}. 

P 

B. 8: Model UUU 

For model UUU, there are no constraints, so that 

Q[2] = £ V± [i g I*" 1 1 - tr{* i - 1 S < } + 2tr{*r 1 B i f TSJ - tr^B^A, + f Af^} 

i=l 

Differentiating Q[ 2 1(©2|©) with respect to B^ and VP^ 1 , respectively, and set the results equal to 
zero, we obtain 

B t = Sitiidi + tJSiti)- 1 and = Diag{S ? - B 4 f J SJ. 

C: Some computational issues 

C. l: Specification of initial values 

We investigate the issue of getting admissible initial values for the implementation of the 
AECM algorithm. Essentially, good initial values of parameter estimates may speed up the con- 
vergence to the global maximum. A simple procedure of automatically constructing a set of initial 
values is outlined below. 

1. Perform the fc-means algorithm initialized with a random seed. Subtract each observation from 

its initial cluster means. Then, do a single tFA fit to these "centering samples" . The resulting 
ML estimates 6 = (B, Sf?, ip, v) are taken as initial values of constrained parameters, namely 
B = B for the common factor loading restriction, = for the homoscedastic error 
covariance restriction, and V^ ' = tp for the isotropic restriction. The starting values for the 
dfs are set as v^' = 50, for i = 1, . . . , g, corresponding to an initial assumption of PGMM. 

2. Initialize the zero-one membership indicator Zj ^ = {zf^} 9 i=l according to the fc-mcans cluster- 

ing result. The initial values for the mixing proportions and component locations are then 
given by 

En _(0) ~(°) 

w i ~ z ancl Pi - — ~^r- 



En 
.7 = 1 
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3. Divide the data into g groups corresponding to Zj (j = l,...,n). Compute the sample 
variance-covariance matrix, namely V \, of each group. Following McNicholas and Murphy 
(2008), the initial values for the unconstrained are set as 

B (0 = 



Aiieii • • • \ Xi q e 



where V Xi r is the rth largest eigenvalue of V i and e^ r is the corresponding eigenvector. 
Straightforwardly, the initial values for and tpi are respectively taken as 

^trm-Bf^}. 
P 

In practice, multiple modes typically exist on the complete-data log-likelihood surface. Thus, 
the algorithm needs to be initialized with a variety of starting values. This can be done by 
performing fc- means clustering with various random seeds. 

C.2: Stopping rule 

To assess the convergence of the algorithm, various stopping criteria have been proposed in 
the literature. The most commonly used stopping criteria are based on lack-of-progress in the 
log-likelihood or parameter estimates and such criteria are not bona fide convergence criteria. 

We apply the Aitken acceleration scheme to determine the convergence of each AECM algo- 
rithm. The Aitken's acceleration at iteration k is defined by 

l(k+i) _ j(k) 



o<*> = 



/(fc) _ l(k-i) ' 



* (fc) 

where for brevity of notation means the log- likelihood value evaluated at . The asymptotic 
estimate of the log-likelihood at iteration + 1 is given by 

rifc+u = ^ + _L"(fc+i) _ Z (fc)). 

1 — aw 

The algorithm is claimed to have reached convergence when |Zoo — \ < e, where e is the desired 
tolerance. Unless otherwise stated, e = 10~ 5 herein. 

C.3: Model selection 

The widely used Bayesian information criterion can be used to choose the best member of the 
PTMM family and the number of factors. Herein, the BIC is defined as 

BIC = 2£ max - mlogn, 

where £ max is the maximized log-likelihood and m is the number of free parameters in the model. 
Accordingly, models with large BIC scores are preferred. 
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